Methods, systems, and circuits for text independent speaker recognition with automatic learning features

ABSTRACT

Embodiments provide a method and system of text independent speaker recognition with a complexity comparable to a text dependent version. The scheme exploits the fact that speech is a quasi-stationary signal and simplifies the recognition process based on this theory. The modeling allows the speaker profile to be updated progressively with the new speech sample that is acquired during usage time.

PRIORITY CLAIM

This application claims priority to provisional patent application No.61/749,094, filed 4 Jan. 2013, which is incorporated herein byreference.

TECHNICAL FIELD

The present disclosure relates generally to speech recognition systems,and more specifically to time scalable text independent speechrecognition with automatic learning features.

BACKGROUND

Speaker recognition is a biometric modality that uses a person's voicefor identification purposes [1]. It is different from speech recognitionwhere transcription of the spoken word is desired instead of theidentity of the speaker. Before the system can be used, it must gothrough a training procedure where speech sample from the user areintroduced to the system to build its speaker model. In the testingphase, the user speech is compared to the speaker model database and theone which closely resembles it is identified.

There are two main types of speaker recognition.

Text dependent speaker recognition

In this mode, the user needs to utter specific keyword or phrase to beaccepted by the system. It can either be password or a prompted phrase.The application here has a strong control over user input and it needsuser cooperation to perform the identification process.

Most text dependent speaker recognition system uses Hidden Markov Model(HMM) which provides statistical representation of the individualspeech. FIG. 1 shows the structure of HMM. The training process requiresthe user to say the phrase numerous time to build the statistical model.

Another method is to use template matching where a sequence of featurevectors is build from the fixed phrase. The verification can be done byusing Dynamic Time Warping (DTW) to measure the similarity between thetest phrase and the template [2].

Text independent speaker recognition

This working mode allows the user to say anything he or she wants, andthe system should be able to perform the identification. Sufficientspeech samples are needed in order to make the accurate recognition andthe system has no knowledge of what is being spoken. The advantage isthat the process can be done without the user cooperation. Various formsof neural network can be trained to perform this task, but the de factoreference method is the Gaussian Mixture Model (GMM) to represent thespeaker model. Usually this GMM are adapted from a Universal BackgroundModel (UBM) using adaptation method such as maximum a posteriori (MAP).In the recent years, Support Vector Machine (SVM) has been found to bethe most robust classifier in speaker verification, and its combinationwith GMM has successfully increased the accuracy of the system. Pleaserefer to [3] for a more complete overview of text independent speakerrecognition. The National Institute of Standards and Technologyperiodically held a Speaker Recognition Evaluation to address variousresearch problems and gauge the technology development related to thisfield.

There is a need for improved methods and circuits for voice recognitionin lower power applications such as in battery powered devices likesmartphones.

SUMMARY

Text dependent system has a lesser complexity than text independentversion because of the system knowledge of the phrase that was going tobe used for identification. However, text independent is more desireddue its flexibility and unobtrusiveness. Many methods have been proposedto reduce the complexity of this system for example by usingpre-quantization or speaker pruning [5]. This invention seeks to addressthe complexity issue of text independent speaker recognition byimitating the concept of text dependent version.

FIG. 2 shows the training process of text dependent speaker recognition.Multiple utterance of the keyword or phrase is needed to build thereliable Markov Chain. For illustration purpose, in FIG. 2, the keyword“apple” is represented by 3 states model. The training process searchfor the transition probability between states (a11, a12 etc.) and thestates statistics itself. Each state is usually represented by either asingle Gaussian with full co-variance matrix, or a Gaussian Mixture withdiagonal covariance matrix. In FIG. 2, b1 to b3 represents the GaussianMixture parameters, which consist of means, variance and the mixtureweight. The training process to find these parameters is beyond thescope of this document, but the final results should maximise thelikelihood of the trained word being generated by the model.

In this invention, we try to build a text independent speakerrecognition using the same structure as text de-pendent HMM.Unfortunately in this case, we do not have the user cooperation duringtesting so the user can say any word. It is also impossible to train theMarkov chain discussed earlier because there is no transcribed speech totrain it. This invention solves this problem by modelling a user speechinto a smaller unit. This is the same modelling strategy used in [6]. Auser speech is clustered, without the need of transcription, into 64distinct units that represent the acoustic space of the speaker, or inother words, the speaker model. Each of this state can be thought of asone state of the Markov chain, even though there is no knowledge on whatword it would represent. The second problem is to construct thetransition probability between states. It is impossible to find thetransition probability without the context of the word or phrase beingmodelled. This invention solves this by exploiting the fact that speechis a quasi-stationary signal. Following this, the transition probabilityof a state to itself is set to be higher than for a state to otherstates. These numbers are uniform for all the states in the speakermodel. FIG. 3 illustrates a subset of this speaker model. Contrary toFIG. 2, there is no association of each cluster with a particular soundor phoneme.

Text dependent systems rely on the user saying the predetermined phraseor prompted phrase to decide the identity of the speaker. This systemhas prior knowledge on how the user would say that particular phrase,and this knowledge is generally captured using hidden Markov model. Textindependent speaker recognition on the other hand collects samples ofuser's voice, in which the user can say anything he or she wants. Thesystem will then built a profile from those collected speech and comparethem to the list of pro-file available in its database. This task ismore computationally intensive as the system is required to processsignificantly more speech samples compared to the text dependentcounterpart. However, the advantage of text independent version is thatrecognition can be done without the user's cooperation and in a verydiscreet manner. This embodiment provides a new method of textindependent speaker recognition with a complexity comparable to a textdependent version. The scheme exploits the fact that speech is aquasi-stationary signal and simplifies the recognition process based onthis theory. The modeling allows the speaker profile to be updatedprogressively with the new speech sample that is acquired during usagetime.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows the structure of a Hidden Markov Model.

FIG. 2 shows the training process of text dependent speaker recognitionwhere multiple utterance of the word “apple” is needed to construct thestatistics of the Hidden Markov Model.

FIG. 3 shows a subset of this speaker model used in this embodiment,showing only 4 clusters.

FIG. 4 shows the block diagram of the process to build speaker profile.

FIG. 5 shows an example of the template path for text dependent version.

FIG. 6 shows the best path probability (highest loglk value) for eachspeaker.

FIG. 7 shows the working of the embodiment where the path of a speakerwill be closer to GMM mean that that of a non speaker, resulting in ahigher log likelihood value. .

FIG. 8 shows the block diagram to extract MFCC which is the featuresused in this embodiment.

FIG. 9 shows the triangular filter used on the PSD during MFCCextraction to obtain each Mel frequency bin value.

FIG. 10 shows the decoding process per audio frame that has to becalculated per speaker model.

DETAILED DESCRIPTION

The embodiment provides a new method to reduce the complexity of textindependent speaker recognition system by reusing the structure of textdependent version. Instead of comparing the input speech with a templatepath, only the probability of each cluster is calculated. The transitionprobability is substituted with two constant to represent transition tothe same cluster and to different cluster. With this embodiment, onlythe log likelihood (loglk) variable per cluster needs to be stored ateach iteration, that is whenever new frame of speech is received. Noother information needs to be retained for future processing, whichresults in memory savings. Contrary to conventional methods of textindependent speaker recognition which requires enough speech samplebefore the system is able to make the decision, this embodiment enablesthe system to make the decision at any point in time by looking at thecurrent loglk value. The confidence of the result can be setproportional to the difference in likelihood between the winning speakerand the next highest result. When confidence is high, the speech isautomatically used to improve the speaker profile by performing furthertraining process.

In the following description, certain details are set forth inconjunction with the described embodiments to provide a sufficientunderstanding of the embodiment. One skilled in the art will appreciate,however, that the embodiment may be practiced without these particulardetails. Furthermore, one skilled in the art will appreciate that theexample embodiments described below do not limit the scope of thepresent disclosure, and will also understand that various modifications,equivalents, and combinations of the disclosed embodiments andcomponents of such embodiments are within the scope of the presentdisclosure. Embodiments including fewer than all the components of anyof the respective described embodiments may also be within the scope ofthe present disclosure although not expressly described in detail below.Finally, the operation of well-known components and/or processes has notbeen shown or described in detail below to avoid unnecessarily obscuringthe present embodiment.

This embodiment can be combined with various front end speech processingas it is independent of the features used for the recognition process.In this section, it will be demonstrated how this will work with asimple front end using MFCC as features.

The training part of this innovation consists of building the acousticspace of the speaker model. FIG. 4 illustrates this process. The speechdata are processed per frame, and the Mel Frequency CepstralCoefficients (MFCC) are extracted. These MFCC coefficients are thenclustered into 64 groups depending on its Euclidian distance. Anyclustering methods can be applied for this purpose, one of which is theK-means clustering method. Each of these clusters is then modelled usingGaussian Mixture, and the results constitute the speaker profile. Thisprofile is built for each of the registered speaker.

During usage time, a speech from unknown speaker is received and thesystem will choose which of the profile that speech most likely belongsto. This invention will test each of the speaker profile and find theprobability through these steps (note that this processing is done perframe)

The MFCC of the speech signals is extracted

The probability of that frame belonging to each of the clusters arecalculated

The final probability is calculated by taking into account thetransition from the previous cluster.

Repeat the process for the next frame.

This process is similar to text dependent speaker recognition, but sincethere is no fixed word or phrase for the recognition, the template pathsequence between clusters does not exist. FIG. 5 shows an example of thetemplate path for text dependent version. This path represents certainwords being said, and for text dependent version, this path will becompared with the speech from unknown speaker. In this innovation, thetext independent speaker recognition does not compare it to any path,but only generating the best sequence which results in the highestprobability value. There is one issue that must be dealt with, that isthe transition probability. The transition probability of a templatepath is generated from multiple utterance of the same word, which wasnot available for text independent version. This innovation relies ononly two transition probability value: the transition to itself, andtransition to other cluster. Since speech is a quasi stationary signal,the value of the former is much higher than the later. The followingpseudo code explains the steps in deriving the final probability value:

/*----Initialisation----*/ T = number of frames of test speech sample;for(i=0;i<T;i++) for(j=0;j<Number_of_cluster ;j++) loglk[i]G]=−1000000000000000000.0; for(i=0;i<Number_of_cluster;i++) { likelihood = GMM_likelihood( input, i, 0, speaker); loglk[0][i]=logx(likelihood); } /*----Recursion-----*/ for(t=1;t<T;t++) { for(j=0;j< Number_of_cluster;j++)  { Initialize maximum_value to avery small number; for(i=0;i< Number_of_cluster;i++)  {if(i!=j) //moving from cluster i to cluster j  value =loglk[t−1][i]+logx_transition; else //remain in same cluster  value =loglk[t−1][i]+logx_selfTransition; if(value > maximum_value)maximum_value=value;  }  likelihood = GMM_likelihood( input, j, t,speaker ); loglk[t][j]=logx(likelihood) + maximum_value;  } }/*-----Find highest loglikelihood-----*/ Result =−1000000000000000000.0; for(i=0;i< Number_of_cluster;i++)  { If(loglk[T−1][i] > Result) Result = loglk[T−1][i];  }

The function GMM_likelihood(input, j, t, speaker) computes theprobability of the “input” frame at time “t” belonging to the “speaker”cluster number “j”. This is calculated per Gaussian component by

${f\left( {{x;\mu},\sigma^{2}} \right)} = {\frac{1}{\sigma \sqrt{2\pi}}e^{- \frac{{({x - \mu})}^{2}}{2\sigma_{2}}}}$

where x is the input signal (in this case, the MFCC of the inputsignal), μ and σ2 are the mean and variance of the Gaussiandistribution. The results per component are finally combined as:

${GMM\_ likelihood} = {\sum\limits_{m = 0}^{M - 1}{w_{m}{f\left( {x;{\mu_{m}\sigma_{m}^{2}}} \right)}}}$

where M is the number of mixtures and wm is the mixture weight.

The simplification in this method came from the two variable“logx_transition” and “logx_selfTransition” which contains the loglikelihood of the transition to different state and to the same staterespectively. In text dependent speaker recognition, these values aresquare matrixes representing the state transition probability derivedduring the training process with transcribed speech. Since we do nothave that, the value used are logx(0.95) for self transition andlogx(0.05) for the other one, emphasizing the point that it should bemore likely to maintain its current state.

The pseudo code above can be distinctively separated intoinitialization, processing, and ending state. This is an advantage forreal time application as the initialization and the ending state onlyneeds to be performed once, that is before and after the frame loop. Themain loop (noted with the “recursion” comment) is what's being performedon each frame. It utilizes dynamic programming concept to find thehighest probable state sequence. The variable “loglk[t][j]” holds thebest probability value at time t, calculated from t=0, and ending atstate j. In the actual implementation, only one dimensional variableloglk[j] is required since there was no need to keep the value for thetime that has already passed. Each speaker will have their own loglk[j]variable and the speaker with the highest value is chosen as the outputspeaker.

The more speech sample available, the more reliable the result is. FIG.6 shows the best path probability for each speaker. The x axis is thetimeline in terms of frame. The decision of the winning speaker can bemade at any point in time. The more it progresses, the more distinct thedifference is between the speakers, and the confidence level is higherto choose the winning speaker. The difference between the bestprobability and the second can be made as security setting. Forapplications which is more tolerant to error, this threshold can be madesmall so the decision can be made quicker on who is the output speaker.For applications requiring reliability however, this threshold should beset higher such that only when the system is confidence enough with theoutput (difference with the second winning speaker is high), then theoutput speaker can be decided. A reset needs to be done after a periodof time, and the most appropriate time to do it is after a sufficientperiod of silence (no speech) is observed.

FIG. 7 illustrates the working of this scheme. The solid line pathrepresents the speech of the correct speaker. It rest nearer to theblack dot (representing the mean of the Gaussian distribution), andhence would results in higher probability value compared to the nonspeaker counterpart (the dashed line path). As mentioned before, thesystem does not care about what was being said so the path can beimmediately deleted after the probability value is calculated.

Detail Description of Preferred Embodiments

This invention can be combined with various front end speech processingas it is independent of the features used for the recognition process.

In this section, it will be demonstrated how this will work with asimple front end using MFCC as features.

Front End Processing

The analysis of speech and speaker recognition task are not done on thespeech signal itself, but on features derived from it. All incomingspeech signal goes through this front end processing. FIG. 8 shows theblock diagram to extract MFCC which is the features used in thisinvention

Preemphasis

Preemphasis is done using the following equation

S(t)=S _(in)(t)−0.96S _(in)(t−1)

where Sin is the input signal and t is the time index.

Framing

Processing is done per frame, where each frame here is 320 sample inlength and progressing by 160 sample to the next frame. In 16 khzsampling rate, this correspond to 20 ms frame progressing by 10 ms each.The Hamming window is applied on each frame, following the equationbelow

${w(n)} = {0.54 - {0.46{\cos \left( \frac{2\pi \; n}{N - 1} \right)}}}$

where N is the frame size (in this case 320) and n is the time index ofthis window.

PSD

Discrete Fourier Transform (DFT) is used to analyze the frequencycontent of the signal and the Power Spectral Density (PSD) is calculatedas follows:

${{PSD}(k)} = {{\sum\limits_{t = 0}^{{FFTlen} - 1}{{S_{win}(t)}.e^{{- {jtk}}\frac{2\pi}{FFTlen}}}}}^{2}$

MEL

The MEL warping is approximated by

mel(f)=2595log₁₀(1+f/ ₇₀₀)

Triangular filter is used on the PSD as shown in FIG. 9. The output iscalculated as

${{Mel}(i)} = {\sum\limits_{k = 0}^{{FFTlen}/2}{W_{i,k}{{PSD}(k)}}}$

where i is the Mel frequency bin, W is the weight of the triangularfilter.

Log and DCT

The final step is to take the log and perform the Discrete CosineTransform (DCT) to get the final MFCC output.

${{MFCC}(j)} = {\sum\limits_{i = 1}^{Q}{\log \left( {{{Mel}(i)}{\cos \left( {\frac{j\; \pi}{Q}\left( {i - 0.5} \right)} \right)}} \right)}}$

The rest of the analysis uses these MFCC values. In this invention, 24MFCC were used.

Training Process: Generating the Speaker Profile

The speaker profile is a representation of the speaker acoustic space.In this invention, 64 cluster of Gaussian Mixture Model is used, madefrom user speech as depicted in FIG. 4. After going through the frontend processing (step 1), the features are clustered using k-meansalgorithm as follows:

Initialization

64 vectors are randomly chosen as the initial centroids.

Nearest Neighbour Search

Each of the training vectors are grouped with the centroid that has theclosest Euclidian distance to it.

Centroid Update

Within each centroid, a new mean is calculated taking accounts all thevectors that has been grouped with it. The new mean is set as the newcentroid.

Repeat step b and c until the distortion falls under a threshold.

Each of the clusters is then represented by Gaussian Mixture Model.

Testing/Decoding

FIG. 10 illustrates the decoding process per audio frame that has to becalculated per speaker model. The speaker with the maximum output valueis chosen to be output speaker. Note that with this scheme, only thevariable Loglk is needed per speaker, as it represent the probability ofthe speech to end at the corresponding cluster. No information about thepath sequence is needed. The output speaker can be decided at any pointin time.

Automatic Training

When enough speech is obtained during the decoding process and theconfidence level to choose a particular output speaker is above apredetermined threshold, the speech samples are used to perform furtheradaptation to the speaker profile, following the training processdescribes in step 2. The only different is that the initial centroidsare not random, but following the current cluster.

Effects of the Invention

The invention provides a new method to reduce the complexity of textindependent speaker recognition system by reusing the structure of textdependent version. Instead of comparing the input speech with a templatepath, only the probability of each cluster is calculated. The transitionprobability is substituted with two constant to represent transition tothe same cluster and to different cluster. With this invention, only thelog likelihood (loglk) variable per cluster needs to be stored at eachiteration, that is whenever new frame of speech is received. No otherinformation needs to be retained for future processing, which results inmemory savings. Contrary to conventional methods of text independentspeaker recognition which requires enough speech sample before thesystem is able to make the decision, this invention enables the systemto make the decision at any point in time by looking at the currentloglk value. The confidence of the result can be set proportional to thedifference in likelihood between the winning speaker and the nexthighest result. When confidence is high, the speech is automaticallyused to improve the speaker profile by performing further trainingprocess.

FIG. 11 is a functional block diagram of an electronic device 800including speech-recognition circuitry 802 contained in processingcircuitry 804 according to one embodiment of the present disclosure. Thespeech recognition circuitry 802 corresponds to circuitry and/orsoftware that executes the speaker dependent voice recognitionalgorithms described above with reference to FIGS. 1-7. The processingcircuitry 804 may be any suitable processing circuitry, such as amicroprocessor where the electronic device 800 is a personal computer oran applications processor where the electronic device is a smartphone ortablet computer. Similarly, the touch controller 806 may include anysuitable digital and/or analog circuitry to perform the desiredfunctions of the controller.

The electronic device 800 includes a touch controller 806 that detectsthe presence of touches or touch points P(X,Y) and gestures includingsuch touch points on a touch screen 808 that is coupled to thecontroller. The touch screen 808 has a number of touch sensors 810positioned on the touch screen to detect touch points P(X,Y), with onlythree touch sensors being shown merely to simplify the figure. The touchcontroller 806 controls the touch screen 808 to detect a user's finger,stylus, or any other suitable device, all of which will collectively bereferred to as a “user device” herein. The detection of the user deviceat a particular location on the touch screen 808 is defined as a touchpoint P(X,Y) on the touch screen. An X-axis and Y-axis are shown in FIG.1, with the X coordinate of a touch point P(X,Y) corresponding to apoint along the X-axis and the Y coordinate to a point along the Y-axis.The touch sensors 810 generate corresponding sensor signals responsiveto a touch point P(X,Y) and provide these signals to the touchcontroller 808 for processing. The touch sensors 810 are typicallycontained in some sort of transparent sensor array that is part of thetouch screen 808, the detailed structure of which is understood by thoseskilled in the art and thus will not be described herein. The number andlocation of the touch sensors 810 can vary as can the particular type ofsensor, such as ultrasonic, resistive, vibration, or capacitive sensors.

The processing circuitry 804 is coupled to the touch controller 102 andis operable to execute applications or “apps” 812 designed to perform aspecific function or provide a specific service on the electronic device800. Where the electronic device 800 is a cellular phone or a tabletcomputer, for example, the applications 812 can include a wide varietyof different types of applications, such as music applications, emailapplications, video applications, game applications, weatherapplications, reader applications, and so on. The touch controller 806reports touch information to the applications 812, which operate inresponse thereto to control operation of the application and/or theelectronic device 800.

The electronic device 800 can be any kind of suitable electronic deviceor system. The device 800 need not include the touch screen 808 and caninclude additional components not expressly illustrated in FIG. 8. Forexample, the electronic device 800 could be a personal computer system,desktop or laptop, a television, a home-theater system, a smartappliance, a vehicle such as a car or truck where the algorithm is usedin lieu of a key to access, activate, and deactivate the vehicle, asecurity system that provides or denies the speaker access to afacility, and so on.

In one embodiment, the electronic device 800 operates in a sleep orlow-power mode of operation and the speaker dependent voice recognitionalgorithm executes during this mode to detect the utterance of the codephrase by an authorized user or users. The low-power mode is a mode ofoperation that is common in electronic devices in which at least some ofthe electronic components in the device are powered down or placed in analternate state to reduce the power consumption of these components andthereby reduce the overall power consumption of the electronic device.In response to detecting the code phrase, the electronic device 800 isthen “activated” or leaves the low-power mode of operation. For example,where the device 800 is a smart phone, when the algorithm detects theutterance of the code phrase by an authorized user the home screen orsome other screen is then displayed to give the speaker access to andallow him or her to operate the device.

One skilled in the art will understood that even though variousembodiments and advantages have been set forth in the foregoingdescription, the above disclosure is illustrative only, and changes maybe made in detail, and yet remain within the broad principles of theembodiment. For example, some of the components described above may beimplemented using either digital or analog circuitry, or a combinationof both, and also, where appropriate, may be realized through softwareexecuting on suitable processing circuitry. The code phrase may be inany language, not just English, and could even be gibberish or a randomsequence of sounds or a sound desired defined by the user. Therefore,the present disclosure is to be limited only as defined by the appendedclaims and any such later-introduced claims supported by the presentdisclosure.

What is claimed is:
 1. A method of text independent speaker recognitionwhere the speaker profile is represented by clustered features extractedfrom user speech data during training session.
 2. The method of claim 1where each of the clustered features is represented by a Gaussianmixture model, denoting a smaller unit of speech.
 3. The method of claim1 where for each frame of incoming speech, log likelihood (loglk) iscalculated for each cluster to find the probability that the speech isending at that cluster.
 4. The method of claim 3 where the probabilityof speech ending at that cluster takes into account the probability ofthe cluster that precedes it and the transition probability to thecurrent cluster.
 5. The method of claim 4 where transition probabilitybetween clusters are set lower than the transition probability toitself.
 6. The method of claim 4 where the transition probabilitybetween clusters is set at 0.05 and the transition probability to itselfis set at 0.95.
 7. The method of claim 1 where the output decision canbe taken at any point in time by evaluating the highest value of loglkvariable among the speakers.
 8. The method of claim 1 where the outputdecision confidence value can be calculated by evaluating the differencebetween loglk variable among the two highest scoring speakers.
 9. Themethod of claim 8 where output confidence value can be the set asthreshold to accept or reject a speaker, and the threshold value isbased on the desired reliability of the system.
 10. The method of claim3 where the value of loglk is reset to zero upon encountering a periodof silence in the input speech.
 11. The method of claim 3 where thevalue of loglk is reset, upon encountering a period of silence in theinput speech, to a value that corresponds to the speaker changeprobability according to a conversation state model.
 12. The method ofclaim 1 where GMM of the clustered features are continuously updatedusing speech frames that have high confidence value to belong to thecorresponding speaker.